Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Ziheng Zhang_606300061

Display machine information for reproducibility:

sessionInfo()

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Display your machine memory.

memuse::Sys.meminfo()

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

ln -s /Users/zihengzhang/Downloads/203B/203b-hw/hw2/labevents_parquet \
./labevents_pq
patient_id <- 10013310 # set patient ID

# find race for that patient
race_info <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == patient_id) |>
  select(subject_id, race)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# find other personal info for that patient
personal_info <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == patient_id) |>
  select(subject_id, gender, anchor_age) |>
  rename(age = anchor_age) |>
  left_join(race_info, by = "subject_id") |>
  distinct()
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# find top 3 diagnoses for that patient in each stay ID
diagonose_code_info <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnose_info <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == patient_id) |>
  select(-icd_version) |>
  filter(seq_num <= 3) |>
  left_join(diagonose_code_info, by = "icd_code") |>
  select(-icd_version)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# find ADT info for that patient
ADT_info <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == patient_id) |>
  select(subject_id, hadm_id, intime, outtime, eventtype, careunit) |>
  filter(eventtype != "discharge") |>
  arrange(intime) |>
  distinct()
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# find lab info for that patient
lab_info <- open_dataset("./labevents_pq", format = "parquet") |>
  filter(subject_id == patient_id) |>
  select(subject_id, hadm_id, charttime) |>
  collect() |>
  distinct()
  
# find procedure info for that patient
procedure_code_info <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedure_info <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == patient_id) |>
  select(subject_id, hadm_id, chartdate, icd_code) |>
  mutate(chartdate = as.POSIXct(chartdate, format = "%Y-%m-%d %H:%M:%S")) |>
  left_join(procedure_code_info, by = "icd_code") |>
  select(-icd_version) |>
  arrange(chartdate) |>
  
  # split long title and only keep the first part
  mutate(short_title = str_split(long_title, ",") %>% sapply(function(x) 
    x[1])) |> 
  distinct()
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# combine personal info into required format
title_text <- sprintf("Patient %d, %s, %d years old, %s", 
                      personal_info$subject_id, personal_info$gender, 
                      personal_info$age, tolower(personal_info$race))

top3_text <- diagnose_info |>
  slice(1:3) |>
  select(long_title) |>
  mutate(first_five_words = str_split(long_title, "\\s+") %>% 
    sapply(function(x) {
      if (length(x) < 5) {
        paste(x, collapse = " ") # in case some diagnosis has less than 5 words
        } else {
          paste(x[1:5], collapse = " ")
          }
      })) |> 
  pull(first_five_words)

subtitle_text <- sprintf("%s\n%s\n%s", top3_text[1], top3_text[2], top3_text[3])

# create the plot and y-axis
empty_plot <- ggplot() + 
  geom_blank() +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  labs(x = "Calendar Time", title = title_text, 
       subtitle = tolower(subtitle_text)) +
  theme_bw()

ADT_plot <- empty_plot + 
  geom_segment(data = ADT_info, aes(x = intime, xend = outtime, y = "ADT", 
                                    yend = "ADT", color = careunit), 
               linewidth = ifelse(str_detect(ADT_info$careunit, "CCU") | 
                                    str_detect(ADT_info$careunit, "ICU"), 
                                  8, 3)) +
  labs(color = "Care Unit") +
  theme(legend.position = "bottom", legend.box = "vertical", 
        text = element_text(size = 8))

Lab_plot <- ADT_plot + 
  geom_point(data = lab_info, aes(x = charttime, y = "Lab"), 
             size = 5, shape = "+")

Procedure_plot <- Lab_plot + 
  geom_point(data = procedure_info, aes(x = chartdate, y = "Procedure", 
                                        shape = short_title), size = 5) +
  
  # set manually the shape of the procedure to be the total 9 procedures
  scale_shape_manual(values = 
                       seq(1, length(unique(procedure_info$short_title)))) +
  labs(shape = "Procedure", y = "") +
  guides(shape = guide_legend(nrow = 5)) +
  theme(legend.position = "bottom", legend.box = "vertical", 
        text = element_text(size = 8))

plot(Procedure_plot)

Answer: This patient has been admitted to the hospital three times. Each admission results in the top 3 diagnoses. Therefore, for the subtitle, we select one of the admission records and displayed the top 3 diagnoses for that admission. Here we select admission record with the first hadm_id, hadm_id = 21243435, and we only display the first few words of the long title because the long title is too long.  For the label of procedure, since the long title is too long, we only display the words before the first comma of the long title. The default setting of the number of labels is 6 and here we manually set the shape of the procedure to be the total 9 procedures and then all of the procedures are displayed in the plot.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

ln -s /Users/zihengzhang/Downloads/203B/203b-hw/hw2/chartevents.parquet \
./chartevents_pq
patient_id <- 10013310

item_icu_info <- read_csv("~/mimic/icu/d_items.csv.gz") |>
  filter(itemid %in% c(220045, 220180, 220179, 220210, 223761)) |>
  select(itemid, abbreviation) |>
  distinct()
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chartevents_info <- open_dataset("./chartevents_pq", format = "parquet") |>
  filter(subject_id == patient_id) |>
  filter(itemid %in% c(220045, 220180, 220179, 220210, 223761)) |>
  select(subject_id, stay_id, itemid, charttime, valuenum) |>
  collect() |>
  left_join(item_icu_info, by = "itemid") |>
  mutate(charttime = with_tz(charttime, "UTC"))


title_text <- sprintf("Patient %d ICU stays - Vitals", 
                      chartevents_info$subject_id)

icu_plot <- ggplot(chartevents_info, aes(x = charttime, y = valuenum, 
                                         color = abbreviation)) +
  geom_point(size = 1) +
  geom_line() +
  labs(title = title_text,
       x = "",
       y = "") +
  facet_grid(rows = vars(abbreviation), cols = vars(stay_id), scales = "free") +
  theme_light() +
  theme(legend.position = "none") +
  scale_x_datetime(date_labels = "%b %d %H:%M") +
  guides(x = guide_axis(n.dodge = 2)) # to avoid overlapping of x-axis labels

plot(icu_plot)

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(icustays_tble, width = Inf)
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

# calculate the number of unique subject_id
icustays_tble |> distinct(subject_id) |> count()
# A tibble: 1 × 1
      n
  <int>
1 50920
icustays_tble |> count(subject_id) 
# A tibble: 50,920 × 2
   subject_id     n
        <dbl> <int>
 1   10000032     1
 2   10000980     1
 3   10001217     2
 4   10001725     1
 5   10001884     1
 6   10002013     1
 7   10002155     3
 8   10002348     1
 9   10002428     4
10   10002430     1
# ℹ 50,910 more rows
icustays_tble |> count(subject_id) |> 
  ggplot() + 
  geom_bar(mapping = aes(x = n)) + 
  labs(title = "Number of ICU stays per subject_id") + 
  xlab("ICU Stays") + 
  ylab("Number of subject_id") + 
  theme_bw()

Answer: There are 50920 unique subject_id. A subject_id can have multiple ICU stays.The number of ICU stays per subject_id is shown in the graph above. Most subject_id have 1 ICU stay, and most of subject_id have the ICU stay less than 10 times. The distribution of the number of ICU stays per subject_id is right-skewed and the number of subject_id decreases as the number of ICU stays increases. There are also some patients who stay in the ICU for many times, but the number of these patients is very small.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(admissions_tble, width = Inf)
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

admissions_tble |> distinct(subject_id) |> count()
# A tibble: 1 × 1
       n
   <int>
1 180733
admissions_tble |> count(subject_id)
# A tibble: 180,733 × 2
   subject_id     n
        <dbl> <int>
 1   10000032     4
 2   10000068     1
 3   10000084     2
 4   10000108     1
 5   10000117     2
 6   10000248     1
 7   10000280     1
 8   10000560     1
 9   10000635     1
10   10000719     1
# ℹ 180,723 more rows

Answer: There are 180733 unique subject_id. A subject_id can have multiple admissions.

admissions_tble |> count(subject_id) |> 
  ggplot() + 
  geom_histogram(mapping = aes(x = n), binwidth = 3, fill = "salmon", 
                 color = "black") + 
  labs(title = "Admissions vs Subject ID") + 
  xlab("Number of Admissions") + 
  ylab("Count") + 
  theme_bw()

Answer: The number of admissions per subject_id is shown in the graph above. Many subject_id have less than 3 admission times, and most of subject_id have the admission less than 30 times. The number of admissions per subject_id is right-skewed and the number of subject_id decreases as the number of admissions increases. There are also some patients who are admitted for many times, but the number of these patients is very small.

# extract the hour from admittime
admissions_hour <- hour(admissions_tble$admittime) |> as_tibble()
ggplot(data = admissions_hour) + 
  geom_bar(mapping = aes(x = value)) + 
  labs(title = "Distribution of Admission Hour") +
  xlab("Hour") + 
  ylab("Count") + 
  theme_bw()

Answer: The admission hour is shown in the graph above. It seems there is something unusual. The number of admissions is highest at midnight, gradually decreasing over time. This may be because there is often some emergency situation at midnight. The time interval 1AM-10AM has fewer admissions, except for a sudden increase at 7AM. This may be because many patients will choose to get up the next day and go to the hospital in the early morning. Then, as time progresses, the number of admissions gradually increases. The evening and night are the peak admission times within a day, with the overall situation being relatively uniform.

# extract the minute from admittime
admissions_minute <- minute(admissions_tble$admittime) |> as_tibble()
ggplot(data = admissions_minute) + 
  geom_bar(mapping = aes(x = value)) + 
  labs(title = "Distribution of Admission Minute") + 
  xlab("Minute") + 
  ylab("Count") + 
  theme_bw()

Answer: The admission minute is shown in the graph above. There are very clear unusual patterns. There are significant peaks in admissions at the top of the hour, 15 minutes past, 30 minutes past, and 45 minutes past the hour. This may be because the hospital’s admission system or the staff usually input admission data on the hour, 15 minutes past, 30 minutes past, and 45 minutes past the hour. At other times, the number of admissions is relatively uniformly distributed.

# calculate the length of hospital stay
admissions_length <- difftime(admissions_tble$dischtime, 
                            admissions_tble$admittime, units = "days") |> 
  as_tibble()
ggplot(admissions_length, aes(x = value)) + 
  geom_histogram(binwidth = 5, fill = "salmon", color = "black") + 
  labs(title = "Distribution of Length of Stay") + 
  xlab("Length") + 
  ylab("Count") + 
  theme_bw()
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.

Answer: The length of hospital stay is shown in the graph above. The length of hospital stay is right-skewed. Most of the patients stay in the hospital for less than 10 days, and the number of patients decreases as the length of hospital stay increases. There are also some patients who stay in the hospital for a long time, but the number of these patients is very small.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(patients_tble, width = Inf)
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

ggplot(patients_tble, aes(x = gender, fill = gender)) +
  geom_bar() +
  # show the count on top of the bar
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, 
            color = "black", size = 3) +
  labs(title = "Distribution of Gender",
       x = "Gender",
       y = "Count") + 
  theme_bw()

ggplot(patients_tble, aes(x = anchor_age)) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  labs(title = "Distribution of Anchor_Age",
       x = "Anchor_Age",
       y = "Count") + 
  theme_bw()

ggplot(patients_tble, aes(x = gender, y = anchor_age, fill = gender)) +
  geom_violin() +
  labs(title = "Anchor Age Distribution by Gender",
       x = "Gender",
       y = "Anchor Age")

Answer: The number of female patients is slightly higher than the number of male patients. 158,553 vs 141,159. The distribution of anchor_age is shown in the graph above. The number of patients is highest in the youngest age group, around 20 years old, and the number of patients firstly decreases and then increases as the age increases. There is another peak in the age group 55. And then the number of patients gradually decreases as the age increases. Next, we compare anchor_age in different gender groups. The distribution of anchor_age in different gender groups are almost the same. The largest number of age groups for both gender groups are around 20 years old and then the number of people decreases when they are 40 years old. There is a slight increase in numbers around 60. Finally, as age increases, the number of people decreases for both gender groups. In the age group 90, there is another small increase, with females exhibiting a larger increase than males.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

d_item_subset <- read_csv("~/mimic/hosp/d_labitems.csv.gz") |> 
  select(itemid, label) # we only need itemid and its label name
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble_subset <- select(icustays_tble, subject_id, stay_id, intime)

labevents_tble <- open_dataset("./labevents_pq", format = "parquet") |> 
  filter(itemid %in% c(50912, 50971, 50983, 50902, 
                       50882, 51221, 51301, 50931)) |> 
  select(subject_id, itemid, valuenum, storetime) |>
  arrange(subject_id, itemid) |>
  collect() |>
  
  # filter subject_id in labevents to match with icustays_tble
  semi_join(icustays_tble_subset, by = c("subject_id")) |>
  
  # combine label names
  left_join(d_item_subset, by = c("itemid")) |>
  
  # combine intime and stay_id by subject_id
  left_join(icustays_tble_subset, by = c("subject_id")) |>
  
  # filter the measurement before the ICU stay
  filter(storetime < intime) |>
  group_by(subject_id, label, stay_id) |>
  
  # sort by storetime and get the last available measurement
  arrange(storetime, .by_group = TRUE) |>
  summarise(finalvalue = last(valuenum)) |>
  mutate(label = tolower(label)) |>
  ungroup() |>
  
  # transform the table to wide format
  pivot_wider(names_from = label, values_from = finalvalue) |>
  rename("wbc" = "white blood cells")
Warning in left_join(left_join(semi_join(collect(arrange(select(filter(open_dataset("./labevents_pq", : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 845 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'subject_id', 'label'. You can override
using the `.groups` argument.
print(labevents_tble, width = Inf)
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose hematocrit
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>      <dbl>
 1   10000032 39553978          25       95        0.7     102       41.1
 2   10000980 39765666          21      109        2.3      89       27.3
 3   10001217 34592300          30      104        0.5      87       37.4
 4   10001217 37067082          22      108        0.6     112       38.1
 5   10001725 31205490          NA       98       NA        NA       NA  
 6   10001884 37510196          30       88        1.1     141       39.7
 7   10002013 39060235          24      102        0.9     288       34.9
 8   10002155 31090461          23       98        2.8     117       25.5
 9   10002155 32358465          26       85        1.4     133       22.4
10   10002155 33685454          24      105        1.1     138       39.7
   potassium sodium   wbc
       <dbl>  <dbl> <dbl>
 1       6.7    126   6.9
 2       3.9    144   5.3
 3       4.1    142   5.4
 4       4.2    142  15.7
 5       4.1    139  NA  
 6       4.5    130  12.2
 7       3.5    137   7.2
 8       4.9    135  17.9
 9       5.7    120   9.8
10       4.6    139   7.9
# ℹ 68,457 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link. Answer: This procedure has been done in Q1.2.

item_icu_subset <- read_csv("~/mimic/icu/d_items.csv.gz") |>
  select(itemid, label) |>
  
  # change the label format into required format
  mutate(label = str_replace_all(label, " ", "_")) |>
  mutate(label = tolower(label))
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble_subset <- select(icustays_tble, subject_id, stay_id, intime, 
                               outtime)

chartevents_tble <- arrow::open_dataset("./chartevents_pq", 
                                         format = "parquet") |> 
  filter(itemid %in% c(220045, 220180, 220179, 220210, 223761)) |> 
  select(subject_id, itemid, valuenum, charttime) |>
  arrange(subject_id, itemid) |>
  collect() |>
  semi_join(icustays_tble_subset, by = c("subject_id")) |>
  left_join(item_icu_subset, by = c("itemid")) |>
  left_join(icustays_tble_subset, by = c("subject_id")) |>
  filter(charttime >= intime & charttime <= outtime) |>
  group_by(subject_id, label, stay_id) |>
  
  # sort by charttime and get the first available measurement
  arrange(charttime, .by_group = TRUE) |>
  summarise(firstvalue = first(valuenum)) |>
  ungroup() |>
  pivot_wider(names_from = label, values_from = firstvalue)
Warning in left_join(left_join(semi_join(collect(arrange(select(filter(arrow::open_dataset("./chartevents_pq", : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 91 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'subject_id', 'label'. You can override
using the `.groups` argument.
# change the fourth and fifth column order to match the screenshot 
temp <- chartevents_tble[, 4]
chartevents_tble[, 4] <- chartevents_tble[, 5]
chartevents_tble[, 5] <- temp
temp_colname <- colnames(chartevents_tble)[4]
colnames(chartevents_tble)[4] <- colnames(chartevents_tble)[5]
colnames(chartevents_tble)[5] <- temp_colname

print(chartevents_tble, width = Inf)
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pressure_systolic
        <dbl>    <dbl>      <dbl>                                <dbl>
 1   10000032 39553978         91                                   84
 2   10000980 39765666         77                                  150
 3   10001217 34592300         96                                  167
 4   10001217 37067082         86                                  151
 5   10001725 31205490         55                                   73
 6   10001884 37510196         38                                  180
 7   10002013 39060235         80                                  104
 8   10002155 31090461         94                                  118
 9   10002155 32358465         98                                  109
10   10002155 33685454         68                                  126
   non_invasive_blood_pressure_diastolic respiratory_rate temperature_fahrenheit
                                   <dbl>            <dbl>                  <dbl>
 1                                    48               24                   98.7
 2                                    77               23                   98  
 3                                    95               11                   97.6
 4                                    90               18                   98.5
 5                                    56               19                   97.7
 6                                    12               10                   98.1
 7                                    70               14                   97.2
 8                                    51               18                   96.9
 9                                    65               23                   97.7
10                                    61               18                   95.9
# ℹ 73,154 more rows

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer: anchor_age is age at the anchor_year so we use year(intime) - anchor_year to calculate the age difference between the year of intime and anchor_year and then add it to anchor_age to get the age at intime.

mimic_icu_cohort <- icustays_tble |>
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) |>
  left_join(patients_tble, by = c("subject_id")) |>
  left_join(labevents_tble, by = c("subject_id", "stay_id")) |>
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) |>
  mutate(age_at_intime = anchor_age + year(intime) - anchor_year) |>
  filter(age_at_intime >= 18)
print(mimic_icu_cohort, width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los admittime           dischtime          
   <dttm>              <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2189-06-27 20:38:27 0.498 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 2157-11-21 22:08:00 1.12  2157-11-18 22:56:00 2157-11-25 18:00:00
 4 2157-12-20 14:27:41 0.948 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 2110-04-12 23:59:56 1.34  2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17  2131-01-07 20:39:00 2131-01-20 05:15:00
 7 2160-05-19 17:33:33 1.31  2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10 18:09:21 0.859 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2129-08-10 17:02:38 6.18  2129-08-04 12:44:00 2129-08-18 16:53:00
10 2130-09-27 22:13:41 3.89  2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    1 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    0 F              80        2128 2008 - 2010      
10                    0 F              80        2128 2008 - 2010      
   dod        bicarbonate chloride creatinine glucose hematocrit potassium
   <date>           <dbl>    <dbl>      <dbl>   <dbl>      <dbl>     <dbl>
 1 2180-09-09          25       95        0.7     102       41.1       6.7
 2 2193-08-26          21      109        2.3      89       27.3       3.9
 3 NA                  22      108        0.6     112       38.1       4.2
 4 NA                  30      104        0.5      87       37.4       4.1
 5 NA                  NA       98       NA        NA       NA         4.1
 6 2131-01-20          30       88        1.1     141       39.7       4.5
 7 NA                  24      102        0.9     288       34.9       3.5
 8 2131-03-10          26       85        1.4     133       22.4       5.7
 9 2131-03-10          24      105        1.1     138       39.7       4.6
10 2131-03-10          23       98        2.8     117       25.5       4.9
   sodium   wbc heart_rate non_invasive_blood_pressure_systolic
    <dbl> <dbl>      <dbl>                                <dbl>
 1    126   6.9         91                                   84
 2    144   5.3         77                                  150
 3    142  15.7         86                                  151
 4    142   5.4         96                                  167
 5    139  NA           55                                   73
 6    130  12.2         38                                  180
 7    137   7.2         80                                  104
 8    120   9.8         98                                  109
 9    139   7.9         68                                  126
10    135  17.9         94                                  118
   non_invasive_blood_pressure_diastolic respiratory_rate temperature_fahrenheit
                                   <dbl>            <dbl>                  <dbl>
 1                                    48               24                   98.7
 2                                    77               23                   98  
 3                                    90               18                   98.5
 4                                    95               11                   97.6
 5                                    56               19                   97.7
 6                                    12               10                   98.1
 7                                    70               14                   97.2
 8                                    65               23                   97.7
 9                                    61               18                   95.9
10                                    51               18                   96.9
   age_at_intime
           <dbl>
 1            52
 2            76
 3            55
 4            55
 5            46
 6            77
 7            57
 8            83
 9            81
10            82
# ℹ 73,171 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the first vital measurements within the ICU stay

  • Length of ICU stay los vs first ICU unit

summary_by_race <- mimic_icu_cohort |>
  group_by(race) |>
  summarise(
    mean_los = mean(los),
    sd_los = sd(los),
    median_los = median(los),
    min_los = min(los),
    max_los = max(los)
  )
print(summary_by_race, width = Inf)
# A tibble: 33 × 6
   race                          mean_los sd_los median_los min_los max_los
   <chr>                            <dbl>  <dbl>      <dbl>   <dbl>   <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     4.46   6.54       2.09 0.0879     49.9
 2 ASIAN                             3.49   4.90       1.93 0.0631     59.5
 3 ASIAN - ASIAN INDIAN              4.20   7.61       1.95 0.160      66.3
 4 ASIAN - CHINESE                   3.36   5.39       1.86 0.0206     77.7
 5 ASIAN - KOREAN                    4.23   6.35       2.08 0.211      39.4
 6 ASIAN - SOUTH EAST ASIAN          3.04   5.42       1.82 0.0784     76.1
 7 BLACK/AFRICAN                     3.82   5.90       2.02 0.0345     54.8
 8 BLACK/AFRICAN AMERICAN            3.28   4.80       1.85 0.00346    95.8
 9 BLACK/CAPE VERDEAN                3.22   4.94       1.70 0.0296     51.7
10 BLACK/CARIBBEAN ISLAND            3.61   5.13       1.95 0.0445     38.2
# ℹ 23 more rows
ggplot(summary_by_race, aes(x = race, y = mean_los)) +
  geom_bar(stat = "identity", fill = "salmon", width = 0.5) +
  labs(title = "Mean ICU Length of Stay by Race", x = "Race", 
       y = "Mean Length of Stay") +
  coord_flip() +
  theme_bw()

Comment: The mean lengths of ICU stay for all race groups are more than 2 days. The AMERICAN INDIAN/ALASKA NATIVE group has the highest mean length of stay, which is around 4.5 days, while the HISPANIC/LATINO - MEXICAN group has the lowest mean length of stay, which is more than 2.5 days.

ggplot(mimic_icu_cohort, aes(los, fill = insurance)) + 
  geom_histogram(color = "black", binwidth = 5) +
  facet_wrap(~insurance) +
  labs(title = "Length of ICU Stay vs Insurance", x = "Length of Stay") +
  theme_bw()

Comment: Overall, the length of ICU stay in all insurance types tends to be relatively short and the distributions are all right-skewed. As the length of stay increases, the frequency decreases dramatically. The number of people using other insurances is the largest and that for Medicare insurance is almost the same, just a little bit smaller in the shortest los group. The number of people using Medicaid insurance is the smallest in all lengths of stay, accounting for only about one-tenth of the total of the other two types combined.

ggplot(mimic_icu_cohort, aes(los, fill = marital_status)) + 
  geom_histogram(color = "black", binwidth = 5) +
  facet_wrap(~marital_status) +
  labs(title = "Length of ICU Stay vs Martial Status", x = "Length of Stay") +
  theme_bw()

Comment: Overall, the length of ICU stay in all marital statuses tends to be relatively short and the distributions are all right-skewed. As the length of stay increases, the frequency decreases dramatically. The number of people who are married marital status remains the largest in almost all lengths of stay. In contrast, the number of people who are divorced and widowed marital status is small.

ggplot(mimic_icu_cohort, aes(los, fill = gender)) +
  geom_histogram(color = "black", binwidth = 5) +
  facet_wrap(~gender) +
  labs(title = "Length of ICU Stay vs Gender", x = "Length of Stay") +
  theme_bw()

Comment: Overall, the length of ICU stay in all gender types tends to be relatively short and the distributions are all right-skewed. As the length of stay increases, the frequency decreases dramatically. The number of female is less than that for male remains in all length of stay groups.

ggplot(mimic_icu_cohort, aes(x = age_at_intime, y = los)) +
  geom_point(alpha = 0.3) +
  labs(title = "Length of ICU Stay vs Age at Intime", x = "Age at Intime", 
       y = "Length of Stay") +
  theme_bw()

Comment: The length of ICU stay tends to be relatively short for all ages. As the age increases, the length of stay tends to increase and then decreases. The length of stay seems to be longer for people aged 65-80.

f1 <- ggplot(mimic_icu_cohort, aes(x = bicarbonate, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f2 <- ggplot(mimic_icu_cohort, aes(x = chloride, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f3 <- ggplot(mimic_icu_cohort, aes(x = creatinine, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f4 <- ggplot(mimic_icu_cohort, aes(x = glucose, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f5 <- ggplot(mimic_icu_cohort, aes(x = hematocrit, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f6 <- ggplot(mimic_icu_cohort, aes(x = potassium, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f7 <- ggplot(mimic_icu_cohort, aes(x = sodium, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f8 <- ggplot(mimic_icu_cohort, aes(x = wbc, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

f <- grid.arrange(f1, f2, f3, f4, f5, f6, f7, f8, ncol = 2, nrow = 4, 
                  top = paste("Length of ICU Stay vs", 
                              "Last Available Lab Measurements before ICU Stay",
                              collapse = " "))
Warning: Removed 9050 rows containing missing values (`geom_point()`).
Warning: Removed 8883 rows containing missing values (`geom_point()`).
Warning: Removed 5770 rows containing missing values (`geom_point()`).
Warning: Removed 9099 rows containing missing values (`geom_point()`).
Warning: Removed 5017 rows containing missing values (`geom_point()`).
Warning: Removed 8901 rows containing missing values (`geom_point()`).
Warning: Removed 8872 rows containing missing values (`geom_point()`).
Warning: Removed 5094 rows containing missing values (`geom_point()`).

Comment: Data points of creatinine, glucose, and wbc figures are relatively concentrated in regions where measurement values are small, and as these three values increase, length of stay tends to decrease. The data points of the other five figures show almost in the middle range of measurement values, and as these values increase, length of stay tends to increase and then decrease.

g1 <- ggplot(mimic_icu_cohort, aes(x = heart_rate, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

g2 <- ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_systolic, 
                                   y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

g3 <- ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_diastolic, 
                                   y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

g4 <- ggplot(mimic_icu_cohort, aes(x = respiratory_rate, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")

g5 <- ggplot(mimic_icu_cohort, aes(x = temperature_fahrenheit, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  labs(y = "LOS")


g <- grid.arrange(g1, g2, g3, g4, g5, ncol = 2, 
                  top = paste("Length of ICU Stay vs", 
                              "First Vital Measurements within the ICU Stay", 
                              collapse = " "))
Warning: Removed 18 rows containing missing values (`geom_point()`).
Warning: Removed 979 rows containing missing values (`geom_point()`).
Warning: Removed 983 rows containing missing values (`geom_point()`).
Warning: Removed 98 rows containing missing values (`geom_point()`).
Warning: Removed 1362 rows containing missing values (`geom_point()`).

Comment: All of five vital measurements have some outliers and they cause influence on normal data analysis.

h1 <- ggplot(mimic_icu_cohort, aes(x = heart_rate, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  xlim(0, 250) +
  labs(y = "LOS")

h2 <- ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_systolic, 
                                   y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  xlim(0, 300) +
  labs(y = "LOS")

h3 <- ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_diastolic, 
                                   y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  xlim(0, 220) +
  labs(y = "LOS")

h4 <- ggplot(mimic_icu_cohort, aes(x = respiratory_rate, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  xlim(0, 100) +
  labs(y = "LOS")

h5 <- ggplot(mimic_icu_cohort, aes(x = temperature_fahrenheit, y = los)) +
  geom_point(size = 0.5, alpha = 0.3) +
  xlim(0, 110) +
  labs(y = "LOS")


h <- grid.arrange(h1, h2, h3, h4, h5, ncol = 2, 
                  top = paste("Length of ICU Stay vs", 
                              "First Vital Measurements within the ICU Stay", 
                              collapse = " "))
Warning: Removed 20 rows containing missing values (`geom_point()`).
Warning: Removed 980 rows containing missing values (`geom_point()`).
Warning: Removed 992 rows containing missing values (`geom_point()`).
Warning: Removed 104 rows containing missing values (`geom_point()`).
Warning: Removed 1363 rows containing missing values (`geom_point()`).

Comment: After limiting the range of x-axis, the figures of heart_rate, non_invasive_blood_pressure_systolic, non_invasive_blood_pressure_diastolic, and respiratory_rate have similar patterns. temperature_fahrenheit around 35 tends to have slightly short lengths of stay. Most values fall within the range of 90 to 105 Fahrenheit. Within this range, as temperature_fahrenheit increases, the length of stay tends to increase and then decrease. The longest lengths of stay occur around a temperature of approximately 98 Fahrenheit, which is the normal body temperature.

summary_by_care <- mimic_icu_cohort %>%
  group_by(first_careunit) %>%
  summarise(
    mean_los = mean(los),
    sd_los = sd(los),
    median_los = median(los),
    min_los = min(los),
    max_los = max(los)
  )
print(summary_by_care, width = Inf)
# A tibble: 9 × 6
  first_careunit                                   mean_los sd_los median_los
  <chr>                                               <dbl>  <dbl>      <dbl>
1 Cardiac Vascular Intensive Care Unit (CVICU)         3.29   4.89       1.99
2 Coronary Care Unit (CCU)                             3.19   4.00       2.01
3 Medical Intensive Care Unit (MICU)                   3.26   4.61       1.83
4 Medical/Surgical Intensive Care Unit (MICU/SICU)     3.08   4.30       1.81
5 Neuro Intermediate                                   3.42   3.73       2.34
6 Neuro Stepdown                                       2.59   2.89       1.69
7 Neuro Surgical Intensive Care Unit (Neuro SICU)      6.30   7.46       3.65
8 Surgical Intensive Care Unit (SICU)                  3.84   5.76       1.97
9 Trauma SICU (TSICU)                                  3.83   5.52       1.94
  min_los max_los
    <dbl>   <dbl>
1 0.00352   103. 
2 0.00125    66.6
3 0.00227   110. 
4 0.00145    67.9
5 0.00346    43.5
6 0.0570     29.1
7 0.0216     59.6
8 0.00745   102. 
9 0.00218    99.6
ggplot(summary_by_care, aes(x = first_careunit, y = mean_los)) +
  geom_bar(stat = "identity", fill = "salmon", width = 0.5) +
  labs(title = "Mean ICU Length of Stay by First ICU Unit", 
       x = "First Careunit", y = "Mean Length of Stay") +
  coord_flip() +
  theme_bw()

Comment: The mean lengths of ICU stay for all first_icu groups are more than 2 days. The Neuro Surgical Intensive Care Unit (Neuro SICU) group has the highest mean length of stay, which is more than 6 days, while the Neuro Stepdown group has the lowest mean length of stay, which is around 2.5 days. The remaining seven categories are relatively similar.